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Abstract 

Parallel  algorithms  for  sparse  matrix-matrix  multiplication  typically  spend  most  of  their  time  on 
inter-processor  communication  rather  than  on  computation,  and  hardware  trends  predict  the  relative  cost 
of  communication  will  only  increase.  Thus,  sparse  matrix  multiplication  algorithms  must  minimize 
communication  costs  in  order  to  scale  to  large  processor  counts. 

In  this  paper,  we  consider  multiplying  sparse  matrices  corresponding  to  Erdos-Renyi  random  graphs 
on  distributed-memory  parallel  machines.  We  prove  a  new  lower  bound  on  the  expected  communication 
cost  for  a  wide  class  of  algorithms.  Our  analysis  of  existing  algorithms  shows  that,  while  some  are 
optimal  for  a  limited  range  of  matrix  density  and  number  of  processors,  none  is  optimal  in  general.  We 
obtain  two  new  parallel  algorithms  and  prove  that  they  match  the  expected  communication  cost  lower 
bound,  and  hence  they  are  optimal. 


*  We  acknowledge  funding  from  Microsoft  (Award  #024263)  and  Intel  (Award  #024894),  and  matching  funding  by  U.C.  Discov¬ 
ery  (Award  #DIG07-10227).  Additional  support  comes  from  ParLab  affiliates  National  Instruments,  Nokia,  NVIDIA,  Oracle  and 
Samsung,  as  well  as  Math  Works.  Research  is  also  supported  by  DOE  grants  DE-SC0004938,  DE-SC0005136,  DE-SC0003959, 
DE-SC0008700,  and  AC02-05CH1 1231,  and  DARPA  grant  HR001 1-12-2-0016,  and  grant  1045/09  from  the  Israel  Science  Foun¬ 
dation  (founded  by  the  Israel  Academy  of  Sciences  and  Humanities),  and  grant  2010231  from  the  US-Israel  Bi-National  Science 
Foundation. 


1  Introduction 


Computing  the  product  of  two  sparse  matrices  is  a  fundamental  problem  in  combinatorial  and  scientific  com¬ 
puting.  Generalized  sparse  matrix-matrix  multiplication  is  used  as  a  subroutine  in  algebraic  multigrid  [5], 
graph  clustering  [27]  and  contraction  [15],  quantum  chemistry  [28],  and  parsing  context-free  languages  [22], 
Large-scale  data  and  computation  necessitates  the  use  of  parallel  computing  where  communication  costs 
quickly  become  the  bottleneck.  Existing  parallel  algorithms  for  multiplying  sparse  matrices  perform  reason¬ 
ably  well  in  practice  for  limited  processor  counts,  but  their  scaling  is  impaired  by  increased  communication 
costs  at  high  concurrency. 

Achieving  scalability  for  parallel  algorithms  for  sparse  matrix  problems  is  challenging  because  the  com¬ 
putations  tend  not  to  have  the  surface  to  volume  ratio  (or  potential  for  data  re-use)  that  is  common  in  dense 
matrix  problems.  Further,  the  performance  of  sparse  algorithms  is  often  highly  dependent  on  the  sparsity 
structure  of  the  input  matrices.  We  show  in  this  paper  that  existing  algorithms  for  sparse  matrix-matrix 
multiplication  are  sub-optimal  in  their  communication  costs,  and  we  obtain  new  algorithms  which  are  com¬ 
munication  optimal,  communicating  less  than  the  previous  algorithms  and  matching  new  lower  bounds. 

Our  lower  bounds  require  two  important  assumptions:  (1)  the  sparsity  of  the  input  matrices  is  ran¬ 
dom,  corresponding  to  Erdos-Renyi  random  graphs  (see  Definition  2.1),  and  (2)  the  algorithm  is  sparsity- 
independent,  where  the  computation  is  statically  partitioned  to  processors  independent  of  the  sparsity  struc¬ 
ture  of  the  input  matrices  (see  Definition  2.5).  The  second  assumption  applies  to  nearly  all  existing  al¬ 
gorithms  for  general  sparse  matrix-matrix  multiplication.  While  a  priori  knowledge  of  sparsity  structure 
can  certainly  reduce  communication  for  many  important  classes  of  inputs,  dynamically  determining  and 
efficiently  exploiting  the  structure  of  general  input  matrices  is  a  challenging  problem.  In  fact,  a  common 
technique  of  current  library  implementations  is  to  randomly  permute  rows  and  columns  of  the  input  matrices 
in  an  attempt  to  destroy  their  structure  and  improve  computational  load  balance  [7,  9].  Because  the  input 
matrices  are  random,  our  analyses  are  in  terms  of  expected  communication  costs. 

We  make  three  main  contributions  in  this  paper. 

1.  We  prove  new  communication  lower  bounds.  While  there  is  a  previous  lower  bound  which  applies  to 
sparse  matrix-matrix  multiplication  [2],  it  is  too  low  to  be  attainable.  We  use  a  similar  proof  technique 
but  devise  a  tighter  lower  bound  on  the  communication  costs  in  expectation  for  random  input  matrices 
which  is  independent  of  the  local  memory  size  of  each  processor.  See  Section  3  for  details. 

2.  We  obtain  two  new  communication-optimal  algorithms.  Our  3D  iterative  and  recursive  algorithms 
(see  Sections  4.3  and  4.4)  are  adaptations  of  dense  ones  [13,  25],  though  an  important  distinction  is  that 
the  sparse  algorithms  do  not  require  extra  local  memory  to  minimize  communication.  We  also  optimize 
an  existing  algorithm.  Sparse  SUMMA,  to  be  communication-optimal  in  some  cases. 

3.  We  provide  a  unified  communication  analysis  of  existing  and  new  algorithms.  See  Table  1  for  a 
summary  of  the  expected  communication  costs  of  the  algorithms  applied  to  random  input  matrices.  See 
Section  4  for  a  description  of  the  algorithms  and  their  communication  analysis. 

There  are  many  extensions  of  the  algorithms  and  analysis  presented  in  this  paper.  The  new  algorithms 
have  not  yet  been  benchmarked  and  compared  against  previous  algorithms.  We  plan  to  extend  the  perfor¬ 
mance  studies  of  [9]  to  include  all  of  the  algorithms  considered  here.  Additionally,  we  hope  that  our  analysis 
can  be  extended  to  many  more  types  of  input  matrices.  We  are  especially  interested  in  sparsity  structures 
corresponding  to  applications  which  are  currently  bottlenecked  by  sparse  matrix-matrix  multiplication,  such 
as  the  triple  product  computation  within  algebraic  multigrid.  In  the  case  of  matrix  multiplication,  we  have 
shown  how  to  apply  ideas  from  dense  algorithms  to  obtain  communication-optimal  sparse  algorithms.  Per¬ 
haps  similar  adaptions  can  be  made  for  other  matrix  computations  such  as  direct  factorizations. 
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2  Preliminaries 


Throughout  the  paper,  we  are  interested  in  the  computation  A  B  =  C.  For  sparse  matrix  indexing,  we  use 
the  colon  notation,  where  A (:,*)  denotes  the  /'th  column,  A (i, :)  denotes  the  /th  row,  and  A (i,j)  denotes 
the  element  at  the  (i,  j)th  position  of  matrix  A.  We  use  flops  to  denote  the  number  of  nonzero  arithmetic 
operations  required  when  computing  the  product  of  matrices  A  and  B  and  nnz(-)  to  denote  the  number  of 
nonzeros  in  a  matrix  or  submatrix. 

We  consider  the  case  where  A  and  B  are  n  X  n  ER(d)  matrices: 

Definition  2.1  An  ER(d)  matrix  is  an  adjacency  matrix  of  an  Erdos-Renyi  graph  with  parameters  n  and  d/n. 
That  is,  an  ER(d)  matrix  is  a  square  matrix  of  dimension  n  where  each  entry  is  nonzero  with  probability  d/n. 
We  assume  d  <C  \Jn. 

In  this  case,  the  following  facts  will  be  useful  for  our  analysis. 

Lemma  2.2  Let  A  and  B  be  n  x  n  ER(d)  matrices.  Then 

(a)  the  expected  number  of  nonzeros  in  A  and  in  B  is  dn, 

(b)  the  expected  number  of  scalar  multiplications  in  A  B  is  d2n,  and 

(c)  the  expected  number  of  nonzeros  in  C  is  d2n(  1  —  o(l)). 

Proof.  Since  each  entry  of  A  and  B  is  nonzero  with  probability  d/n,  the  expected  number  of  nonzeros  in 
each  matrix  is  n2(d/n)  =  dn.  For  each  of  the  possible  n3  scalar  multiplications  in  A  •  B,  the  computation  is 
required  only  if  both  corresponding  entries  of  A  and  B  are  nonzero,  which  are  independent  events.  Thus  the 
probability  that  any  multiplication  is  required  is  d2/n2,  and  the  expected  number  of  scalar  multiplications 
is  d2n.  Finally,  an  entry  of  C  =  A  ■  B  is  zero  only  if  all  n  possible  scalar  multiplications  corresponding 
to  it  are  zero.  Since  the  probability  that  a  possible  scalar  multiplication  is  zero  is  (1  —  d2 /n2)  and  the  n 
possible  scalar  multiplications  corresponding  to  a  single  output  entry  are  independent,  the  probability  that 
an  entry  of  C  is  zero  is  (1  —  d2 /n2)n  =  1  —  d2 /n  +  ()(d4 /n2).  Thus  the  expected  number  of  nonzeros  of 
C  is  n2(d2/n  —  0(d4/n2)  =  d2n{  1  —  o(l)),  since  we  assume  d  <C  y/n.  □ 

Definition  2.3  The  computation  cube  of  square  nxn  matrix  multiplication  is  annxnxn  lattice  where  the 
voxel  at  location  (i,  j,  k )  corresponds  to  the  scalar  multiplication  A (i,  k )  •  B(fc,  j ).  We  say  a  voxel  ( i ,  j,  k ) 
is  nonzero  if  for  given  input  matrices  A  and  B,  both  A (i,k)  and  B  ( /;: ,  j )  are  nonzero. 

Given  a  set  of  voxels  V,  the  projections  of  the  set  onto  three  orthogonal  faces  corresponds  to  the  input 
elements  of  A  and  B  necessary  to  perform  the  multiplications  and  the  output  elements  of  C  which  the 
products  must  update.  The  computation  cube  and  this  relationship  of  voxels  to  input  and  output  matrix 
elements  is  shown  in  Figure  1.  The  following  lemma  relates  the  volume  of  V  to  its  projections: 

Lemma  2.4  [  10]  Let  V  be  a  finite  set  of  lattice  points  in  R3,  i.e.,  points  (x.  y,  z )  with  integer  coordi¬ 
nates.  Let  Vx  be  the  projection  of  V  in  the  x-direction,  i.e.,  all  points  (y,  z )  such  that  there  exists  an 
x  so  that  (x,  y ,  z)  E  V.  Define  Vy  and  Vz  similarly.  Let  \  ■  \  denote  the  cardinality  of  a  set.  Then 
\V\  <  y/\Vx\  -\Vy\  •  \VZ\. 

Definition  2.5  A  sparsity-independent  parallel  algorithm  for  sparse  matrix-matrix  multiplication  is  one 
in  which  the  assignment  of  entries  of  the  input  and  output  matrices  to  processors  and  the  assignment  of 
computation  voxels  to  processors  is  independent  of  the  sparsity  pattern  of  the  input  (or  output )  matrices.  If 
an  assigned  matrix  entry  is  zero,  the  processor  need  not  store  it;  if  an  assigned  voxel  is  zero,  the  processor 
need  to  perform  the  computation. 

Our  lower  bound  argument  in  Section  3  will  apply  to  all  sparsity-independent  algorithms.  However,  we 
will  analyze  a  more  restricted  set  of  algorithms  in  Section  4,  those  that  assign  contiguous  brick-shaped  sets 
of  voxels  to  each  processor. 
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2.1  Communication  Model 


We  use  the  parallel  distributed-memory  communication  model  of  Ballard  et  al.  [2],  In  this  model,  every 
processor  has  a  local  memory  of  size  M  words  which  is  large  enough  to  store  one  copy  of  the  output  matrix 
C  distributed  across  the  processors:  M  =  f l(d2n/P).  We  are  interested  in  the  communication  that  occurs 
via  message  passing  between  processors,  and  we  model  the  cost  of  a  message  of  w  words  as  a  +  j3w,  so  that 
a  is  the  cost  per  message  and  f3  is  the  cost  per  word.  To  estimate  the  running  time  of  a  parallel  algorithm,  we 
count  the  cost  of  communication  in  terms  of  number  of  words  W  ( bandwidth  cost)  and  number  of  messages 
S  ( latency  cost)  along  the  critical  path  of  the  algorithm.  That  is,  if  two  pair's  of  processors  communicate 
messages  of  the  same  size  simultaneously,  we  count  that  as  the  cost  of  one  message.  We  assume  a  single 
processor  can  communicate  only  one  message  to  one  processor  at  a  time.  In  this  model,  we  do  not  consider 
contention  or  the  number  of  hops  a  message  travels;  we  assume  the  network  has  all-to-all  connectivity. 

2.2  All-to-all  Communication 

Several  of  the  algorithms  we  discuss  make  use  of  all-to-all  communication.  If  each  processor  needs  to 
send  b  words  to  each  other  processor  (so  each  processor  needs  to  send  a  total  of  b(P  —  1)  words),  the 
bandwidth  lower  bound  is  W  =  Q(bP)  and  the  latency  lower  bound  is  S  =  Q(log  P ).  Each  of  these  bounds 
is  attainable,  but  it  has  been  shown  that  they  are  not  simultaneously  attainable  (see  Theorem  2.9  of  [6]). 
Depending  on  the  relative  costs  of  bandwidth  and  latency,  one  may  wish  to  use  the  point-to-point  algorithm 
(each  processor  sends  data  directly  to  each  other  processor)  which  incurs  costs  of  W  =  0(bP),  S  =  0{P) 
or  the  bit-fixing  algorithm  (each  message  of  b  words  is  sent  by  the  bit-fixing  routing  algorithm)  which  incurs 
costs  of  W  =  0(bP  log  P),  S  =  0(log  P). 


3  Lower  Bounds 


The  general  lower  bounds  for  linear  algebra  [2]  apply  to  our  case  and  give 


W  =  ft 


(  d2n  \ 

\p^m) ' 


(1) 


This  bound  is  highest  when  M  takes  its  minimum  value  d2n/P,  in  which  case  they  become  W  =  fl(\/ d2n/P). 
In  this  section  we  improve  these  lower  bounds  by  a  factor  of  y/n  ■  max{l,  d/  y  P).  For  larger  values  of  M, 
the  lower  bound  in  Equation  1  becomes  weaker,  whereas  our  new  bound  does  not,  and  the  improvement 
factor  increases  to  \fM  ■  max{l,  y1'  P/d\.  The  previous  memory-independent  lower  bound  [4]  reduces  to 
the  trivial  bound  W  =  D(0). 


Theorem  3.1  A  sparsity-independent  sparse  matrix  multiplication  algorithm  with  load-balanced  input  and 
output  has  expected  communication  cost  lower  bounded  by 


w  =  n 


^rnin 


dn  d2n'l\ 


for  Kllid)  input  matrices  on  P  processors. 


Proof.  Consider  the  n3  voxels  that  correspond  to  potential  scalar  multiplications  A(i,  k)  ■  B(7>;,  j).  A 
sparsity-independent  algorithm  gives  a  partitioning  of  these  multiplications  among  the  P  processors.  Let 
V  be  the  largest  set  of  voxels  assigned  to  a  processor,  so  |V|  >  jy.  For  each  i,j,  let  -  be  the  number 
of  values  of  k  such  that  (i,j,  k)  G  V,  see  Figure  3.  We  count  how  many  of  the  voxels  in  V  correspond  to 
ifj  <  j  and  divide  into  two  cases. 
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Case  1:  At  least  ^  voxels  of  V  correspond  to  <  j.  Let  V'  be  these  voxels,  so  \V'\  >  jp. 
We  will  analyze  the  communication  cost  corresponding  to  the  computation  of  V'  and  get  a  bound  on  the 
number  of  products  computed  by  this  processor  that  must  be  sent  to  other  processors.  Since  the  output  is 
load  balanced  and  the  algorithm  is  sparsity-independent,  the  processor  that  computes  V'  is  allowed  to  store 
only  a  particular  set  of  ,Jp  entries  of  C  in  the  output  data  layout.  Since  every  voxel  in  V'  corresponds  to 
an  ifj  <  the  '-p  output  elements  stored  by  the  processor  correspond  to  at  most  pp  voxels  in  V' ,  which 
is  at  most  half  of  |  V'7 1 .  All  of  the  nonzero  voxels  in  the  remainder  of  V'  contribute  to  entries  of  C  that 
must  be  sent  to  another  processor.  In  expectation,  this  is  at  least  pp  nonzero  voxels,  since  each  voxel  is 
nonzero  with  probability  ^ .  Moreover,  from  Lemma  2.2,  only  a  small  number  of  the  nonzero  entries  of 
C  have  contributions  from  more  than  one  voxel,  so  very  few  of  the  values  can  be  summed  before  being 
communicated.  The  expected  bandwidth  cost  is  then  bounded  by  W  =  kl(d2n/ P). 

Case  2:  Fewer  than  £jp  voxels  of  V  correspond  to  £d  <  j.  This  means  that  at  least  £jp  voxels  of  V 
correspond  to  >  j.  Let  V"  be  these  voxels,  so  \  V"\  >  We  will  analyze  the  communication  cost 
corresponding  to  the  computation  of  V"  and  get  a  lower  bound  on  the  amount  of  input  data  needed  by  this 
processor.  For  each  i,  k,  let  be  the  number  of  values  of  j  such  that  [i.j.  k)  £  V" .  Similarly,  for  each 
j,  k,  let  £p.  be  the  number  of  values  of  i  such  that  ( i,j ,  k)  £  V" .  Partition  V"  into  three  sets:  Vo  is  the  set  of 
voxels  that  correspond  to  I//  >  p  and  £pk  >  p,  Va  is  the  set  of  voxels  that  correspond  to  £-j,  <  and  Vb 

is  the  set  of  voxels  that  correspond  to  £?k  <  /  and  £]£.  >  /.  At  least  one  of  these  sets  has  at  least  voxels, 
and  we  divide  into  three  subcases. 

3 

Case  2a:  |Vo|  >  | p .  Let  pa,  Pb,  and  pc  be  the  sizes  of  the  projections  of  Vo  onto  A,  B,  and  C, 
respectively.  Lemma  2.4  implies  that  paPbPC  >  Vo | 2  =  Vppi-  The  assumptions  of  Case  2  implies 

PC  <  cj\-  Thus  paPb  >  drpp,  or  ma x{pa,Pb}  >  Jppp ■  Since  the  situation  is  symmetric  with  respect 

2 

to  A  and  B,  assume  without  loss  of  generality  that  A  has  the  larger  projection,  so  pa  >  Jfpp'  Since  the 
density  of  A  is  -,  this  means  that  the  expected  number  of  nonzeros  in  the  projection  of  Vo  onto  A  is  at 
least  Jppp  ■  Since  each  of  these  nonzeros  in  A  coiTesponds  to  a  ^ ,  it  is  needed  to  compute  Vo  with 

probability  at  least  1  —  (1  —  d/n)n^d  >  1  —  1/e.  Thus  in  expectation  a  constant  fraction  of  the  nonzeros  of 
A  in  the  projection  of  Vo  are  needed.  The  number  of  nonzeros  the  processor  holds  in  the  initial  data  layout 
is  pjr  in  expectation,  which  is  asymptotically  less  than  the  number  needed  for  the  computation.  Thus  we  get 
a  bandwidth  lower  bound  of  W  =  f l(dn/V~P). 

Case  2b:  \Va\  >  pp-  Each  voxel  in  Va  corresponds  to  In  this  case  we  are  able  to  bound  the 

re-use  of  entries  of  mA  to  get  a  lower  bound.  Count  how  many  entries  of  A  correspond  to  each  possible 
value  of  £fk,  1  <  r  <  /,  and  call  this  number  Nr.  Note  that  Ylt=i  r  •  Nr  =  \Va\-  Suppose  a  given  entry 
A (i,  k)  corresponds  to  £< |  =  r.  We  can  bound  the  probability  that  A(i,  k)  is  needed  by  the  processor  to 
compute  Va  as  a  function  f(r).  The  probability  that  both  A(i,  k)  is  needed  is  the  probability  than  A(i,  k ) 
is  nonzero  and  one  of  the  r  voxels  corresponding  to  A(i,  k)  in  Va  is  nonzero,  so 


f(r) 


rd2 
~  2 n2  ’ 


since  r  <  Thus  the  expected  number  of  nonzeros  of  A  that  are  needed  by  the  processor  is 


d/n 

r= 1 


J2  r  ■  Nr 


r=  1 


“  12P 


This  is  asymptotically  larger  than  the  number  of  nonzeros  the  processor  holds  at  the  beginning  of  the  com¬ 
putation,  so  we  get  a  bandwidth  lower  bound  of  W  =  ft(d2n/P). 
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Case  2c:  \Vb\  >  fp.  The  analysis  is  identical  to  the  previous  case,  except  we  look  at  the  number  of 
nonzeros  of  B  that  are  required. 

Since  an  algorithm  may  be  in  any  of  these  cases,  the  overall  lower  bound  is  the  minimum: 


W  =  n 


^rnin 


dn  d2n  )  \ 


□ 


4  Algorithms 

In  this  section  we  consider  algorithms  which  assign  contiguous  bricks  of  voxels  to  processors.  We  categorize 
these  algorithms  into  ID,  2D,  and  3D  algorithms,  as  shown  in  Figure  2.  If  we  consider  the  dimensions  of 
the  brick  of  voxels  assigned  to  each  processor,  ID  algorithms  correspond  to  bricks  with  two  dimensions  of 
length  n  (and  1  shorter),  2D  algorithms  correspond  to  bricks  with  one  dimension  of  length  n  (and  2  shorter), 
and  3D  algorithms  correspond  to  bricks  with  all  3  dimensions  shorter  than  n.  Table  1  provides  a  summary 
of  the  communication  costs  of  the  sparsity-independent  algorithms  we  consider. 

4.1  ID  Algorithms 

4.1.1  Naive  Block  Row  Algorithm 

The  naive  block  row  algorithm  [8]  distributes  the  input  and  output  matrices  to  processors  in  a  block  row 
fashion.  Then  in  order  for  processor  i  to  compute  the  vth  block  row,  it  needs  access  to  the  zth  block  row 
of  A  (which  it  already  owns),  and  potentially  all  of  B.  Thus,  we  can  allow  each  processor  to  compute  its 
block  row  of  C  by  leaving  A  and  C  stationary  and  cyclically  shifting  block  rows  of  B  around  a  ring  of  the 
processors.  This  algorithm  requires  P  stages,  with  each  processor  communicating  with  its  two  neighbors 
in  the  ring.  The  size  of  each  message  is  the  number  of  nonzeros  in  a  block  row  of  B,  which  is  expected 
to  be  dn/P  words.  Thus,  the  bandwidth  cost  of  the  block  row  algorithm  is  dn  and  the  latency  cost  is 
P.  An  analogous  block  column  algorithm  works  by  cyclically  shifting  block  columns  of  A  with  identical 
communication  costs. 

4.1.2  Improved  Block  Row  Algorithm 

The  communication  costs  of  the  block  row  algorithm  can  be  reduced  without  changing  the  assignment  of 
matrix  entries  or  voxels  to  processors  [12].  The  key  idea  is  for  each  processor  to  determine  exactly  which 
rows  of  B  it  needs  to  access  in  order  to  perform  its  computations.  For  example,  if  processor  i  owns  the  zth 
block  row  of  A,  A  $,  and  the  /th  subcolumn  of  A  *  contains  no  nonzeros,  then  processor  i  doesn’t  need  to 
access  the  jth  row  of  B.  Further,  since  the  height  of  a  subcolumn  is  n/P,  the  probability  that  the  subcolumn 
is  completely  empty  is 

Pr  [nnz(Ai(:,j))  =  0]  =  (l  -  ~  1  - 

assuming  d  <  P.  In  this  case,  the  expected  number  of  subcolumns  of  A*  which  have  at  least  one  nonzero  is 
dn / P.  Since  processor  i  needs  to  access  only  those  rows  of  B  which  correspond  to  nonzero  subcolumns  of 
A i,  and  because  the  expected  number  of  nonzeros  in  each  row  of  B  is  d,  the  expected  number  of  nonzeros 
of  B  that  processor  i  needs  to  access  is  d2n/P. 

Note  that  the  local  memory  of  each  processor  must  be  of  size  Q  (d2n/ P)  in  order  to  store  the  output 
matrix  C.  Thus,  it  is  possible  for  each  processor  to  gather  all  of  their  required  rows  of  B  at  once.  The 
improved  algorithm  consists  of  each  processor  determining  which  rows  it  needs,  requesting  those  rows 
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from  the  appropriate  processors,  and  then  sending  and  receiving  approximately  d  rows.  While  this  can  be 
implemented  in  various  ways,  the  bandwidth  cost  of  the  algorithm  is  at  least  D  (d2n/P)  and  if  point-to- 
point  communication  is  used,  the  latency  cost  is  at  least  fi(min {P,dn/P}).  The  block  column  algorithm 
can  be  improved  in  the  same  manner. 

4.1.3  Outer  Product  Algorithm 

Another  possible  ID  algorithm  is  to  partition  A  in  block  columns,  and  B  in  block  rows  [  9].  Without 
communication,  each  processor  locally  generates  an  n  x  n  sparse  matrix  of  rank  n/P,  and  processors 
combine  their-  results  to  produce  the  output  C.  Because  each  column  of  A  and  row  of  B  have  about  d 
nonzeros,  the  expected  number  of  nonzeros  in  the  locally  computed  output  is  d2n/P.  By  deciding  the 
distribution  of  C  to  processors  up  front,  each  processor  can  determine  where  to  send  each  of  its  computed 
nonzeros.  The  final  communication  pattern  is  realized  with  an  all-to-all  collective  in  which  each  processer 
sends  and  receives  0(d2n/P)  words.  Note  that  assuming  A  and  B  are  initially  distributed  to  processors  in 
different  ways  may  be  unrealistic;  however,  no  matter  how  they  are  initially  distributed,  A  and  B  can  be 
transformed  to  block  column  and  row  layouts  with  all-to-all  collectives  for  a  communication  cost  which  is 
dominated  by  the  final  communication  phase. 

To  avoid  the  all-to-all,  it  is  possible  to  compute  the  expected  number  of  blocks  of  the  output  which 
actually  contain  nonzeros;  the  best  distribution  of  C  is  2D,  in  which  case  the  expected  number  of  blocks  of 
C  you  need  to  communicate  is  min{P,  dn/yfP}.  Thus  for  P  >  (dn)2,/:\  the  outer  product  algorithm  can 
have  W  =  0(d2n/P)  and  S  =  0(dn/\fP). 

4.2  2D  Algorithms 

4.2.1  Sparse  SUMMA 

In  the  Sparse  SUMMA  algorithm  [8],  the  brick  of  voxels  assigned  to  a  processor  has  its  longest  dimension 
(of  length  n)  in  the  k  dimension.  For  each  output  entry  of  C  to  which  it  is  assigned,  the  processor  com¬ 
putes  all  the  nonzero  voxels  which  contribute  to  that  output  entry.  The  algorithm  has  a  bandwidth  cost  of 
0(dn/yfP )  and  a  latency  cost  of  0{yfP)  [8]. 

4.2.2  Improved  Sparse  SUMMA 

In  order  to  reduce  the  latency  cost  of  Sparse  SUMMA,  each  processor  can  gather  all  the  necessary  input 
data  up  front.  That  is,  each  processor  is  computing  a  product  of  a  block  row  of  A  with  a  block  column  of 
B,  so  if  it  gathers  all  the  nonzeros  in  those  regions  of  the  input  matrices,  it  can  compute  its  block  of  C  with 
no  more  communication.  Since  every  row  of  A  and  column  of  B  contain  about  d  nonzeros,  and  the  number 
of  rows  of  A  and  columns  of  B  in  a  block  is  n/y/P,  the  number  of  nonzeros  a  processor  must  gather  is 
0(dn/y/P).  Ifd  >  y/P,  then  the  memory  requirements  for  this  gather  operation  do  not  exceed  the  memory 
requirements  for  storing  the  block  of  the  output  matrix  C,  which  is  fl(d2n/P). 

The  global  communication  pattern  for  each  processor  to  gather  its  necessary  data  consists  of  allgather 
collectives  along  processor  columns  and  along  processor  grids.  The  bandwidth  cost  of  these  collectives  is 
0(dn/y/P),  which  is  the  same  as  the  standard  algorithm,  and  the  latency  cost  is  reduced  to  (9 (log  P).  To 
our  knowledge,  this  improvement  has  not  appeared  in  the  literature  before. 

We  might  also  consider  applying  the  optimization  that  improved  the  ID  block  row  (or  column)  algo¬ 
rithm.  Processor  (i,j)  would  need  to  gather  the  indices  of  the  nonzero  subcolumns  of  A,  and  the  nonzero 
subrows  of  B j.  This  requires  receiving  Vt{dn/y/P)  words,  and  so  it  cannot  reduce  the  communication  cost 
of  Sparse  SUMMA. 
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As  in  the  dense  case,  there  are  variants  on  the  sparse  SUMMA  algorithm  that  leave  one  of  the  input 
matrices  stationary,  rather  than  leaving  the  output  matrix  C  stationary  [17].  When  multiplying  ER(d)  matri¬ 
ces,  stationary  input  matrix  algorithms  require  more  communication  that  the  standard  approach  because  the 
global  data  involved  in  communicating  C  is  about  d2n,  while  the  global  data  involved  in  communicating  A 
and  B  is  only  dn. 

4.3  3D  Iterative  Algorithm 

In  this  section  we  present  a  new  3D  iterative  algorithm.  We  start  with  a  dense  version  of  the  algorithm  and 
apply  a  series  of  improvements  in  order  to  match  the  lower  bound. 

4.3.1  3D  Algorithms  for  Dense  Matrix  Multiplication 

The  term  “3D”  originates  from  dense  matrix  multiplication  algorithms  [1],  where  the  processors  are  orga¬ 
nized  in  a  3-dimensional  grid,  and  the  computational  cube  is  mapped  directly  onto  the  cube  of  processors. 
In  the  simplest  case,  the  processors  arc  arranged  in  a  yfP  x  X  yfP  grid.  Let  A  be  distributed  across  the 
P2/3  processors  along  one  face  of  the  cube  and  B  be  distributed  across  the  P2/3  processors  along  a  second 
face  of  the  cube.  Then  each  input  matrix  can  be  broadcast  through  the  cube  in  the  respective  dimensions 
so  that  every  processor  in  the  cube  owns  the  block  of  A  and  the  block  of  B  it  needs  to  compute  its  local 
multiplication.  After  the  computation,  the  matrix  C  can  be  computed  via  a  reduction  in  the  third  dimension 
of  the  cube,  resulting  in  the  output  matrix  being  distributed  across  a  third  face  of  the  cube. 

The  communication  cost  of  this  algorithm  is  the  cost  of  the  two  broadcasts  and  one  reduction.  The  size 
of  the  local  data  in  each  of  these  operations  is  n2/P2//3,  and  the  number  of  processors  involved  is  P1/3, 
so  the  total  bandwidth  cost  is  0(n2/P2/3)  and  the  total  latency  cost  is  O(logP).  These  communication 
costs  are  less  than  the  costs  of  2D  algorithms  for  dense  multiplication  [1,  11,  26].  However,  because  the 
local  computation  involves  matrices  of  size  n2/P2/3,  the  3D  algorithm  requires  more  local  memory  that  is 
necessary  to  store  the  input  and  output  matrices. 

This  tradeoff  between  memory  requirements  and  communication  costs  can  be  managed  in  a  continuous 
way  by  varying  the  dimensions  of  the  processor  grid  (or,  equivalently,  the  dimensions  of  the  bricks  of  voxels 
assigned  to  processors)  [21,  24].  Instead  of  using  a  cubic  \fP  x  \fP  x  \/P  processor  grid,  we  can  use  a 
c  x  y/P/c  x  y/P/c  grid,  where  1  <  c  <  \[P  and  c  =  1  reproduces  a  2D  algorithm.  The  approach  that 
generalizes  Cannon’s  algorithm  [1 1]  is  presented  as  ‘^AD-matrix-multiply”1  as  Algorithm  2  by  Solomonik 
and  Dernmel  [2  ]  and  the  approach  that  generalizes  SUMMA  is  presented  as  “2.5D-SUMMA”  in  Algorithm 
1  by  Solomonik  et  al.  [25].  Both  approaches  yields  a  bandwidth  cost  of  0(n2 /\/~Pc),  a  latency  cost  of 
0(V P/c3  +  logc),  and  local  memory  requirements  of  0(cn2 /P). 

4.3.2  Converting  to  Sparse  Case 

Naive  3D  algorithms  for  sparse  matrix  multiplication  can  be  devised  directly  from  the  dense  versions.  As 
in  [24,  25],  we  assume  the  data  initially  resides  only  on  the  one  of  the  c  layers  and  gets  replicated  along 
the  third  dimension  before  the  multiplications  start.  Then,  each  of  those  layers  executes  a  partial  2D  algo¬ 
rithm  (with  the  partial  contribution  to  C  remaining  stationary),  in  the  sense  that  each  layer  is  responsible  for 
computing  1/c  of  the  total  computation.  Consequently,  the  number  of  steps  in  the  main  stage  of  the  algo¬ 
rithm  becomes  yJP/c 3.  The  final  stage  of  the  algorithm  is  a  reduction  step  among  groups  of  c  processors, 
executed  concurrently  by  all  groups  of  processors  representing  a  fiber  along  the  third  processor  dimension. 

’The  origin  of  the  name  “2.5D”  comes  from  the  fact  that  the  algorithm  interpolates  between  existing  2D  and  3D  algorithms.  We 
use  the  term  3D  to  describe  both  3D  and  2.5D  dense  algorithms. 
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The  latency  cost  is  identical  to  the  dense  algorithm:  0(y/P/c3  +  logo).  The  first  term  comes  from  the 
main  stage  and  the  second  term  comes  from  the  initial  replication  and  final  reduction  phases.  The  bandwidth 
cost  can  be  computed  based  on  the  number  of  nonzeros  in  each  block  of  A,  B,  or  C  communicated.  In  the 
initial  replication  phase,  blocks  of  A  and  B  of  dimension  n  /y/Pfcxn/^PTc  are  broadcast  to  c  different 
processors  for  a  bandwidth  cost  of  0(cdn/P).  In  the  main  stage  of  the  algorithm,  the  same  size  blocks 
are  communicated  during  each  of  the  yj P/c 3  steps  for  a  total  bandwidth  cost  of  0(dn  /yfPc).  The  final 
reduction  is  significantly  different  from  a  dense  reduction,  resembling  more  closely  a  gather  operation  since 
the  expected  number  of  collisions  in  partial  contributions  of  C  is  very  small  for  d  <C  y/n.  Thus,  we  expect 
the  size  of  the  output  to  be  almost  as  large  as  the  sum  of  the  sizes  of  the  inputs.  The  bandwidth  cost  of  the 
final  phase  is  then  0(cd2n/ P). 

Thus,  the  straightforward  conversion  of  the  dense  3D  algorithm  to  the  sparse  case  results  in  the  same 
latency  cost  and  a  total  bandwidth  cost  of  0(dn/ y/Pc  +  cd2n/ P ).  Further,  this  algorithm  will  require  extra 
local  memory,  because  gathering  the  output  matrix  onto  one  layer  of  processors  requires  f l(cd2n/P)  words 
of  memory,  a  factor  of  D(c)  times  as  much  as  required  to  store  C  across  all  processors.  The  extra  space 
required  for  C  dominates  the  space  required  for  replication  of  A  and  B. 

4.3.3  Removing  Input  Replication  and  Assumption  on  Initial  Data  Distribution 

In  developing  a  more  efficient  3D  algorithm  for  the  sparse  case,  our  first  observation  is  that  we  can  avoid  the 
first  phase  of  input  replication.  This  replication  can  also  be  avoided  in  the  dense  case,  but  it  will  not  affect 
the  asymptotic  communication  costs. 

The  dense  2.5D  algorithms  assume  that  the  input  matrices  initially  reside  on  one  y/P/c  X  yJP/c  face  of 
the  processor  grid,  and  the  first  phase  of  the  algorithm  involves  replicating  A  and  B  to  each  of  the  c  layers. 
One  can  view  the  distribution  of  computation  as  assigning  1  / cth  of  the  outer  products  of  columns  of  A  with 
corresponding  rows  of  B  to  each  of  the  c  layers.  In  this  way,  each  layer  of  processors  needs  only  l/cth  of 
the  columns  of  A  and  rows  of  B  rather  than  the  entire  matrices. 

In  order  to  redistribute  the  matrices  across  c  sets  of  processors  in  a  2D  blocked  layout  with  block  size 
n/y/Pj  cxn/ y/ P/c,  blocks  of  y/c  x  y/c  processors  can  perform  all-to-all  operations,  as  shown  in  Figure  4. 
The  cost  of  this  operation  is  W  =  0(dn/P logo)  and  S  =  O(logc)  if  the  bit-fixing  algorithm  is  used, 
removing  the  initial  replication  cost  from  Section  4.3.2.  This  optimization  also  removes  the  extra  memory 
requirement  for  storing  copies  of  A  and  B. 

We  will  see  in  Section  4.3.4  that  the  output  matrix  can  be  returned  in  the  same  2D  blocked  layout  as  the 
input  matrices  were  initially  distributed. 

4.3.4  Improving  Communication  of  C 

Our  next  observation  for  the  sparse  case  is  that  the  final  reduction  phase  to  compute  the  output  matrix 
becomes  a  gather  rather  than  a  reduction.  This  gather  operation  collects  C  onto  one  layer  of  processors;  in 
order  to  balance  the  output  across  all  processors,  we  would  like  to  scatter  C  back  along  the  third  processor 
dimension.  However,  performing  a  gather  followed  by  a  scatter  is  just  an  inefficient  means  of  performing 
an  all-to-all  collective.  Thus  we  should  replace  the  final  reduction  phase  with  a  final  all-to-all  phase.  This 
optimization  reduces  the  bandwidth  cost  of  the  3D  algorithm  to  0(dn/V~Pc  +  d2n/P).  Note  that  the  cost 
of  replicating  A  and  B  in  the  first  phase  of  the  algorithm  would  no  longer  always  be  dominated  by  the 
reduction  cost  of  C,  as  in  Section  4.3.2,  but  the  cost  of  the  input  all-to-all  from  Section  4.3.3  is  dominated 
by  the  output  all-to-all.  By  replacing  the  reduction  phase  with  an  all-to-all,  we  also  remove  the  memory 
requirement  of  Q(cd2n/ P). 


4.3.5  Improving  Communication  of  A  and  B 

Furthermore,  we  can  apply  the  optimization  described  in  Section  4.2.2:  to  reduce  latency  costs  in  the  main 
phase  of  the  3D  algorithm  (which  itself  is  a  2D  algorithm),  processors  can  collect  all  the  entries  of  A 
and  B  they  need  upfront  rather  than  over  several  steps.  This  collective  operation  consists  of  groups  of 
\JP j(?  processors  performing  allgather  operations  (after  the  initial  circular  shifts  of  Cannon’s  algorithm, 
for  example).  Since  the  data  per  processor  in  the  allgather  operation  is  0(cdn/P),  the  bandwidth  cost  of 
the  main  phase  remains  0(dn/ y/~Pc).  The  latency  cost  is  reduced  from  0(y/P/c3)  to  0(log(^/P/c3)), 
yielding  a  total  latency  cost  (assuming  the  bit-fixing  algorithm  is  used  for  the  all-to-all)  of  O(logP).  The 
local  memory  requirements  increase  to  Q(dn/y/~Pc)',  when  c  >  P/d1 2,  this  requirement  is  no  more  than  the 
space  required  to  store  C. 

4.3.6  Optimizing  c 

If  d  >  vP,  then  d2n/P  >  dn/yfP,  and  the  communication  lower  bound  from  Section  3  is  fi(dn/\/P). 
Thus,  choosing  c  =  1  eliminates  the  d2n/P  logc  term,  and  the  3D  algorithm  reduces  to  a  2D  algorithm 
which  is  communication  optimal. 

However,  in  the  case  d  <  y/P,  which  will  become  the  case  in  a  strong-scaling  regime,  increasing  c  can 
reduce  communication.  In  this  case,  the  lower  bound  from  Section  3  is  dl(d2n/ P).  Depending  on  the  all- 
to-all  algorithm  used,  increasing  c  causes  slow  increases  on  latency  costs  and  on  the  d2n/P  log  c  bandwidth 
cost  term,  but  it  causes  more  rapid  decrease  in  the  dn/yfPc  term.  Choosing  c  =  Q(P/d2)  balances  the  two 
terms  in  the  bandwidth  cost,  yielding  a  total  bandwidth  cost  of  0(d2n/P),  which  attains  the  lower  bound  in 
this  case. 

In  summary,  choosing  c  =  min  {  1 .  P/d2  }  allows  for  a  communication  optimal  3D  sparse  matrix  mul¬ 
tiplication  algorithm,  with  a  slight  tradeoff  between  bandwidth  and  latency  costs  based  on  the  all-to-all 
algorithm  used.  Additionally,  making  this  choice  of  c  means  that  asymptotically  no  extra  memory  is  needed 
over  the  space  required  to  store  C. 

4.4  3D  Recursive  Algorithm 

We  also  present  a  new  3D  recursive  algorithm  which  is  a  parallelization  of  a  sequential  recursive  algorithm 
using  the  techniques  of  [3,  13].  Although  we  have  assumed  that  the  input  matrices  are  square,  the  recursive 
algorithm  will  use  rectangular  matrices  for  subproblems.  Assume  that  P  processors  are  solving  a  subprob¬ 
lem  of  size  m  x  k  X  m,  that  is  A  is  m  x  k,  and  B  is  k  x  m,  and  C  is  m  X  m.  We  will  split  into  four 
subproblems,  and  then  solve  each  subproblem  independently  on  a  quarter  of  the  processor.  There  are  two 
natural  ways  to  split  the  problem  into  four  equal  subproblems  that  respect  the  density  similarity  between  A 
and  B,  see  Figure  5. 

1.  Split  m  in  half,  creating  four  subproblems  of  shape  (m/2)  x  k  x  (m/2).  In  this  case  each  of  the  four 
subproblems  needs  access  to  a  different  part  of  C,  so  no  communication  of  C  is  needed.  However  one 
half  of  A  and  B  is  needed  for  each  subproblem,  and  since  each  quarter  of  the  processors  holds  only 
one  quarter  of  each  matrix,  it  will  be  necessary  to  replicate  A  and  B.  This  can  be  done  via  allgather 
collectives  among  disjoint  pairs  of  processors  at  the  cost  of  O  (dmk / (nP))  words  and  0(1)  messages. 

2.  Split  k  in  quarters,  creating  four  subproblems  of  shape  in  x  (k/4)  x  m.  In  this  case  each  of  the  four 
subproblems  needs  access  to  a  different  part  of  A  and  B,  so  with  the  right  data  layout,  no  communication 

of  A  or  B  is  needed.  However  each  subproblem  will  compute  nonzeros  across  all  of  C,  so  those  entries 
need  to  be  redistributed  and  combined  if  necessary.  This  can  be  done  via  all-to-all  collective  among 
disjoint  sets  of  4  processors  at  a  cost  of  0(d2m2/ ( nP ))  words  and  0(1)  messages. 
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At  each  recursive  step,  the  algorithm  chooses  whichever  split  is  cheapest  in  terms  of  communication 
cost.  Initially,  m  =  k  =  n  so  split  1  costs  0(dn/P)  words  and  is  cheaper  than  split  2,  which  costs 
0(d2n/P)  words.  There  are  two  cases  to  consider. 

Case  1:  If  P  <  d2,  the  algorithm  reaches  a  single  processor  before  split  1  becomes  more  expensive  than 
split  2,  so  only  split  1  is  used.  This  case  corresponds  to  a  2D  algorithm,  and  the  communication  costs  are 


log4  P-1 

W=  Y,  o 

i= 0 


/  d(n/2l)n\ 

V  P/*  J 


S  =  O(logP). 


Case  2:  If  P  >  d2,  split  1  becomes  more  expensive  than  split  2  after  log2  d  steps.  After  log2  d  steps, 
the  subproblems  have  dimensions  (n/d)  x  n  X  (n/d)  and  there  arc  P/d2  processors  working  on  each 
subproblem.  The  first  log2  d  steps  are  split  1,  and  the  rest  are  split  2,  giving  communication  costs  of 


log2  rf —  1 

w=  y  0 

i=0 


/  d(n/2l)n\ 

V  P/  4l  ) 


log4P 

+  E  ° 

*=log2  d 


o 


S  =  O(logP). 


This  case  corresponds  to  a  3D  algorithm. 

In  both  cases,  the  communication  costs  match  the  lower  bound  from  Section  3  up  to  factors  of  at  most 
log  P.  Only  layouts  that  are  compatible  with  the  recursive  structure  of  the  algorithm  will  allow  these  com¬ 
munication  costs.  One  simple  layout  is  to  have  A  is  block-column  layout,  B  in  block-row  layout.  Then  C 
should  have  blocks  of  size  n/d  x  n/d,  each  distributed  on  a  different  \P/d 2]  of  the  processors. 


5  Related  Work 

The  classical  serial  algorithm  of  Gustavson  [18],  which  is  the  algorithm  currently  implemented  in  Mat- 
LAB  [14],  does  optimal  work  for  the  common  case  of  flops  <C  nnz,  n.  Yuster  and  Zwick  [29]  gave  a 
O(nnz0'7  n1'2  +  v2+o{  1  ) )  time  serial  algorithm  for  multiplying  matrices  over  a  ring,  which  uses  Strassen- 
like  fast  dense  matrix  multiplication  as  a  subroutine.  Their  algorithm  is  theoretically  close  to  optimal  for  the 
case  of  nnz(C)  =  0(n2),  an  assumption  that  does  not  always  hold. 

The  1 D  improved  block-row  algorithm  is  due  to  Challacombe  [12],  who  calls  the  calculation  of  required 
indices  of  B  the  “symbolic”  phase.  His  algorithm  uses  the  allgather  collective  for  the  symbolic  phase  and 
point-to-point  communication  for  the  subsequent  numerical  phase.  Challacombe,  however,  did  not  analyze 
his  algorithm's  communication  costs.  Kruskal  et  al.  [19]  gave  a  parallel  algorithm  based  on  outer  products, 
which  runs  in  time  0((flops/p)  log  n/  log(flops/p))  on  p  processors.  They  use  the  EREW  PRAM  model, 
and  hence  do  not  analyze  the  communication  costs  of  their  algorithm. 

Sparse  2D  SUMMA  and  its  analysis  is  due  to  Bui  119  and  Gilbert  [8],  who  also  analyzed  the  ID  naive 
block-row  algorithm.  Their  follow-up  work  showed  that  SpSUMMA  provides  good  speedup  to  thousands  of 
cores  on  various  different  input  types,  but  its  scaling  is  limited  by  the  communication  costs  that  consume  the 
majority  of  the  time  [9].  Recent  work  by  Campagna  et  al.  [10]  sketches  a  parallel  algorithm  that  replicates 
the  inputs  (but  not  the  output)  to  all  the  processors  to  avoid  later  communication.  In  our  model,  their 
algorithm  has  bandwidth  cost  W  =  0(dn). 

Grigori  et  al.  [  1 6]  gave  tight  communication  lower  and  upper  bounds  for  Cholesky  factorization  of  sparse 
matrices  corresponding  to  certain  grids.  Pietracaprina  et  al.  [23]  gave  lower  bounds  on  the  number  of  rounds 
it  takes  to  compute  the  sparse  matrix  product  in  MapReduce.  Their  lower  bound  analysis,  however,  is  not 
parametrized  to  the  density  of  the  inputs  and  uses  the  inequality  flops  <  nnz  •  min (nnz,  n).  While  it  is  true 
that  there  exist  assignment  of  input  matrices  for  which  the  inequality  is  tight,  the  lower  bound  does  not  hold 
for  the  majority  of  input  matrix  pairs  for  which  the  inequality  is  not  tight.  By  parametrizing  the  density  of 
inputs,  we  show  that  our  algorithms  are  communication  optimal  over  all  ER(d)  matrices. 
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A  Tables  and  Figures 


Algorithm 

Bandwidth  cost 

Latency  cost 

Previous  Lower  Bound  [2] 

d2n  /  /  d2n 

pVm  -  V  P 

0 

Lower  Bound  [here] 

-(1^1 

1 

Naive  Block  Row  [8] 

dn 

P 

ID 

Improved  Block  Row*  [  2] 

d2n 

P 

min  (log  P, 

Outer  Product*  [  1 9] 

logP 

2D 

SpSUMMA  [8] 

dn 

Up 

VP 

Improved  SpSUMMA  [here] 

dn 

Up 

3D 

Iterative*  [here] 

log  P 

Recursive  [here] 

log  P 

Table  1:  Asymptotic  expected  communication  costs  of  sparsity-independent  algorithms.  Algorithms  marked  with  an 
asterisk  make  use  of  all-to-all  communication.  Depending  on  the  algorithm  used  for  the  all-to-all,  either  the  bandwidth 
or  latency  cost  listed  is  attainable,  but  not  both;  see  Section  2.2. 


Figure  1:  The  computation  cube  for  matrix  multiplication,  with  a  specified  subset  of  voxels  V  along  with  its  three 
projections.  Each  voxel  corresponds  to  the  multiplication  of  its  projection  onto  A  and  B,  and  contributes  to  its 
projection  onto  C. 
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Figure  2:  How  the  cube  is  partitioned  in  ID  (top),  2D  (middle),  and  3D  (bottom)  algorithms. 
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Figure  4:  Possible  redistribution  scheme  for  input  and  output  matrices  for  the  3D  algorithm  with  4x2x2  processor  grid 
(c  =  4).  The  colored  regions  denote  submatrices  owned  by  a  particular  processor.  The  input  matrices  are  initially  in  a 
2D  block  distribution,  and  redistribution  occurs  in  all-to-all  collectives  among  disjoint  sets  of  4  processors.  Since  each 
of  the  c  layers  are  2x2  grids,  the  intermediate  phase  consists  of  allgather  collectives  among  pairs  of  processors.  After 
local  computation,  the  output  matrix  is  redistributed  (and  nonzeros  combined  if  necessary)  via  all-to-all  collectives 
among  the  same  disjoint  sets  of  4  processors,  returning  the  output  matrix  also  in  a  2D  block  distribution. 
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Figure  5:  Two  ways  to  split  the  matrix  multiplication  into  four  subproblems,  with  the  parts  of  each  matrix  required  by 
each  subproblem  labelled.  On  the  left  is  split  1  and  on  the  right  is  split  2. 
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